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Abstract 

Modeling and predicting co-occurrences of events is a fundamental problem of unsupervised learning. In 
this contribution we develop a statistical framework for analyzing co-occurrence data in a general setting 
where elementary observations are joint occurrences of pairs of abstract objects from two finite sets. The 
main challenge for statistical models in this context is to overcome the inherent data sparseness and to 
estimate the probabilities for pairs which were rarely observed or even unobserved in a given sample 
set. Moreover, it is often of considerable interest to extract grouping structure or to find a hierarchical 
data organization. A novel family of mixture models is proposed which explain the observed data by a 
finite number of shared aspects or clusters. This provides a common framework for statistical inference 
and structure discovery and also includes several recently proposed models as special cases. Adopting 
the maximum likelihood principle, EM algorithms are derived to fit the model parameters. We develop 
improved versions of EM which largely avoid overfitting problems and overcome the inherent locality of 
EM-based optimization. Among the broad variety of possible applications, e.g., in information retrieval, 
natural language processing, data mining, and computer vision, we have chosen document retrieval, the 
statistical analysis of noun/adjective co-occurrence and the unsupervised segmentation of textured images 
to test and evaluate the proposed algorithms. 



Copyright © Massachusetts Institute of Technology, 1998 

This report describes research accomplished at the Center for Biological and Computational Learning, the Artificial Intelligence 
Laboratory of the Massachusetts Institute of Technology, the University of Bonn and the beaches of Cape Cod. Thomas 
Hofmann was supported by a M.I.T. Faculty Sponser's Discretionary Fund. Jan Puzicha was supported by the German 
Research Foundation (DFG) under grant # BU 914/3-1. 



1 Introduction 

The ultimate goal of statistical modeling is to explain 
observed data with a probabilistic model. In order to 
serve as a useful explanation, the model should reduce 
the complexity of the raw data and has to offer a certain 
degree of simplification. In this sense statistical modeling 
is related to the information theoretic concept of min- 
imum description length [47, 48]. A model is a 'good' 
explanation for the given data if encoding the model 
and describing the data conditioned on that model yields 
a significant reduction in encoding complexity as com- 
pared to a 'direct' encoding of the data. 

Complexity considerations are in particular relevant 
for the type of data investigated in this paper which is 
best described by the term co-occurrence data (COD) 
[30, 9]. The general setting is as follows: Suppose two 
finite sets X = {x\, . . . , x^} and y = {j/i, . . . , j/m} of 
abstract objects with arbitrary labeling are given. As el- 
ementary observations we consider pairs (xi, yf) £ X xy, 
i.e., a joint occurrence of object X{ with object j/j . All 
data is numbered and collected in a sample set S = 
{( x i(r), Ujir), r ) '■ I < r < L} with arbitrary ordering. 
The information in S is completely characterized by its 
sufficient statistics riij = \{(xi,yj , r) £ S}\ which mea- 
sure the frequency of co-occurrence of X{ and j/j . An im- 
portant special case of COD are histogram data, where 
each object X{ £ X is characterized by a distribution 
of measured features j/j £ y . This becomes obvious 
by partitioning S into subsets S{ according to the X- 
component, then the sample set Si represent an empir- 
ical distribution rij\i over y, where rij\i = riij/rii and 

Hi = \Si\. 

Co-occurrence data is found in many distinctive ap- 
plication. An important example is information retrieval 
where X may correspond to a collection of documents 
and y to a set of keywords. Hence riij denotes the num- 
ber of occurrences of a word j/j in the (abstract of) doc- 
ument Xi. Or consider an application in computational 
linguistics, where the two sets correspond to words be- 
ing part of a binary syntactic structure such as verbs 
with direct objects or nouns with corresponding adjec- 
tives [22, 43, 10]. In computer vision, X may correspond 
to image locations and y to (discretized or categorical) 
feature values. The local histograms rij\i in an image 
neighborhood around Xi can then be utilized for a sub- 
sequent image segmentation [24]. Many more examples 
from data mining, molecular biology, preference analysis, 
etc. could be enumerated here to stress that analyzing 
co-occurrences of events is in fact a very general and 
fundamental problem of unsupervised learning. 

In this contribution a general statistical framework 
for COD is presented. At a first glance, it may seem 
that statistical models for COD are trivial. As a conse- 
quence of the arbitrariness of the object labeling, both 
sets only have a purely nominal scale without ordering 
properties, and the frequencies riij capture all we know 
about the data. However, the intrinsic problem of COD 
is that of data sparseness, also known as the zero fre- 
quency problem [19, 18, 30, 64]. When N and M are very 
large, a majority of pairs (xi, yf) only have a small prob- 
ability of occurring together in S. Most of the counts 



riij will thus typically be zero or at least significantly 
corrupted by sampling noise, an effect which is largely 
independent of the underlying probability distribution. 
If the normalized frequencies are used in predicting fu- 
ture events, a large number of co-occurrences is observed 
which are judged to be impossible based on the data S. 
The sparseness problem becomes still more urgent in the 
case of higher order COD where triplets, quadruples, etc 
are observed. 1 Even in the domain of natural language 
processing where large text corpora are available, one has 
rarely enough data to completely avoid this problem. 

Typical state-of-the-art techniques in natural lan- 
guage processing apply smoothing techniques to deal 
with zero frequencies of unobserved events. Prominent 
techniques are, for example, the back-off method [30] 
which makes use of simpler lower order models and model 
interpolation with held-out data [28, 27]. Another class 
of methods are similarity-based local smoothing tech- 
niques as, e.g., proposed by Essen and Steinbiss [14] and 
Dagan et al. [9, 10]. An empirical comparison of smooth- 
ing techniques can be found in [7]. 

In information retrieval, there have been essentially 
two proposals to overcome the sparseness problem. The 
first class of methods relies on the cluster hypothesis 
[62, 20] which suggests to make use of inter-document 
similarities in order to improve the retrieval perfor- 
mance. Since it is often prohibitive to compute all pair- 
wise similarities between documents these methods typ- 
ically rely on random comparisons or random fraction- 
ation [8]. The second approach focuses on the index 
terms to derive an improved feature representation of 
documents. The by far most popular technique in this 
category is Salton's Vector Space Model [51, 58, 52] of 
which different variants have been proposed with differ- 
ent word weighting schemes [53]. A more recent variant 
known as latent semantics indexing [12] performs a di- 
mension reduction by singular value decomposition. Re- 
lated methods of feature selection have been proposed 
for text categorization, e.g., the term strength criterion 
[66]. 

In contrast, we propose a model-based statistical ap- 
proach and present a family of finite mixture models 
[59, 35] as a way to deal with the data sparseness prob- 
lem. Since mixture or class-based models can also be 
combined with other models our goal is orthogonal to 
standard interpolation techniques. Mixture models have 
the advantage to provide a sound statistical foundation 
with the calculus of probability theory as a powerful in- 
ference mechanism. Compared to the unconstrained ta- 
ble count 'model', mixture models offer a controllable 
way to reduce the number of free model parameters. As 
we will show, this significantly improves statistical infer- 
ence and generalization to new data. The canonical way 
of complexity control is to vary the number of compo- 
nents in the mixture. Yet, we will introduce a different 
technique to avoid overfitting problems which relies on 
an annealed generalization of the classical EM algorithm 
[13]. As we will argue, annealed EM has some additional 
advantages making it an important tool for fitting mix- 



1 Word ra-gram models are examples of such higher order 
co-occurrence data. 



ture models. 

Moreover, mixture models are a natural framework 
for unifying statistical inference and clustering. This is 
particularly important, since one is often interested in 
discovering structure, typically represented by groups of 
similar objects as in pairwise data clustering [23]. The 
major advantage of clustering based on COD compared 
to similarity-based clustering is the fact that it does not 
require an external similarity measure, but exclusively 
relies on the objects occurrence statistics. Since the 
models are directly applicable to co-occurrence and his- 
togram data, the necessity for pairwise comparisons is 
avoided altogether. 

Probabilistic models for COD have recently been in- 
vestigated under the titles of class-based n-gram models 
[4], distributional clustering [43], and aggregate Markov 
models [54] in natural language processing. All three 
approaches are recovered as special cases in our COD 
framework and we will clarify the relation to our ap- 
proach in the following sections. In particular we dis- 
cuss the distributional clustering model which has been 
a major stimulus for our research in Section 3. 

The rest of the paper is organized as follows: Section 
2 introduces a mixture model which corresponds to a 
probabilistic grouping of object pairs. Section 3 then fo- 
cuses on clustering models in the strict sense, i.e., models 
which are based on partitioning either one set of objects 
(asymmetric models) or both sets simultaneously (sym- 
metric models). Section 4 presents a hierarchical model 
which combines clustering and abstraction. We discuss 
some improved variants of the standard EM algorithm 
in Section 5 and finally apply the derived algorithms to 
problems in information retrieval, natural language pro- 
cessing, and image segmentation in Section 6. 



2 Separable Mixture Models 

2.1 The Basic Model 

Following the maximum likelihood principle we first 
specify a parametric model which generates COD over 
X x y, and then try to identify the parameters which 
assign the highest probability to the observed data. The 
first model proposed is the Separable Mixture Model 
(SMM). Introducing K abstract classes C a the SMM gen- 
erates data according to the following scheme: 

1. choose an abstract class C a according to a distri- 
bution 7T„ 

2. select an object X{ £ X from a class-specific condi- 
tional distribution pn\ a 

3. select an object j/j £ y from a class-specific condi- 
tional distribution qj\ a 

Note that steps 2. and 3. can be carried out indepen- 
dently. Hence, X{ and j/j are conditionally independent 
given the class C a and the joint probability distribution 
of the SMM is a mixture of separable component distri- 



butions which can be parameterized by 2 
K K 

Pij = p ( x i> yi) = ^2 7r «- P ( a; »' %' l a ) = X] ^"PiWHW- (!) 

a = l a = l 

The number of independent parameters in the SMM is 
(N + M - 1)K - 1. Whenever K < min{A, M] this is 
significantly less than a complete table with N M entries. 
The complexity reduction is achieved by restricting the 
distribution to linear combinations of K separable com- 
ponent distributions. 

2.2 Fitting the SMM 

To optimally fit the SMM to given data S we have to 
maximize the log-likelihood 



N M 



K 
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(2) 



with respect to the model parameters 9 = (i:,p,q). To 
overcome the difficulties in maximizing a log of a sum, 
a set of unobserved variables is introduced and the cor- 
responding EM algorithm [13, 36] is derived. EM is a 
general iterative technique for maximum likelihood esti- 
mation, where each iteration is composed of two steps: 

• an Expectation (E) step for estimating the unob- 
served data or, more generally, averaging the com- 
plete data log-likelihood, 

• and a Maximization (M) step, which involves max- 
imization of the expected log-likelihood computed 
during the E-step in each iteration. 

The EM algorithm is known to increase the likelihood 
in every step and converges to a (local) maximum of C 
under mild assumptions, cf. [13, 35, 38, 36]. 

Denote by R ra an indicator variable to represent 
the unknown class C a from which the observation 
{xi(r),yj(r),r) £ S was generated. A set of indicator 
variables is summarized in a Boolean matrix R £ 1Z, 
where 

K = J R = (R ra ) : J2 Rra = 1 \ (3) 

denotes a space of Boolean assignment matrices. R effec- 
tively partitions the sample set S into K classes. Treat- 
ing R as additional unobserved data, the complete data 
log-likelihood is given by 
L K 

£° = ^^2 Rra ^Og IT a +logft( r )|„ + log ^( r )| a ) (4) 

r = l a = l 

and the estimation problems for 7r, p and q decouple for 
given R. 

The posterior probability of R ra for a given parameter 
estimate 6^> (E-step) is computed by exploiting Bayes' 
rule and is in general obtained by 

(R ra ) = P(R ra = l\6,S) 

ex P(S\9,R ra = l)P(R ra = l\9). (5) 



2 The joint probability model in (1) was the starting point 
for the distributional clustering algorithm in [43], however 
the authors have in fact restricted their investigations to the 
(asymmetric) clustering model (cf. Section 3). 



Thus for the SMM (R ra ) is given in each iteration by 
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The M-step is obtained by differentiation of (4) using 
(6) as an estimate for R ra and imposing the normaliza- 
tion constraints by the method of Lagrange multipliers. 
This yields a (normalized) summation over the respec- 
tive posterior probabilities 
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Iterating the E- and M-step, the parameters converge 
to a local maximum of the likelihood. Notice, that it is 
unnecessary to store all L ■ K posteriors, as the E- and 
M-step can be efficiently interleaved. 

To distinguish more clearly between the different 
models proposed in the sequel, a representation in terms 
of directed graphical models (belief networks) is utilized. 
In this formalism, random variables as well as parame- 
ters are represented as nodes in a directed acyclic graph 
(cf. [41, 32, 17] for the general semantics of graphical 
models). Nodes of observed quantities are shaded and a 
number of i.i.d. observations is represented by a frame 
with a number in the corner to indicate the number of 
observations (called a plate). A graphical representation 
of the SMM is given in Fig. 1. 




Figure 1: Graphical model representation for the sym- 
metric parameterization of the Separable Mixture Model 

(SMM). 



2.3 Asymmetric Formulation of the SMM 

Our specification of the data generation procedure and 
the joint probability distribution in (1) is symmetric in 
X and y and does not favor an interpretation of the ab- 
stract classes C a in terms of clusters of objects in X or 



y . The classes C a correspond to groups of pair occur- 
rences which we call aspects. As we will see in compari- 
son with the cluster-based approaches in Section 3 this is 
different from a 'hard' assignment of objects to clusters, 
but differs also from a probabilistic clustering of objects. 
Different observations involving the same X{ G X (or 
Dj G y) can be explained by different aspects and each 
objects has a particular distribution over aspects for its 
occurrences. This can be stressed by reparameterizing 
the SMM with the help of Bayes' rule 
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(10) 



The corresponding generative model, which is in fact 
equivalent to the SMM, is illustrated as a graphical 
model in Fig. 2. It generates data according to the fol- 
lowing scheme: 

1. select an object X{ G X with probability pi 

2. choose an abstract class C a according to an object- 
specific conditional distribution p a \i 

3. select an object j/j G y from a class-specific condi- 
tional distribution qj\ a 

The joint probability distribution of the SMM can thus 
be parameterized by 



Pij = P{xi,yj) = PiQj\i, Qj\i 



^2Pa\i'lj\a 



(11) 



Hence a specific conditional distribution qj\i defined on 
y is associated with each object X{, which can be under- 
stood as a linear combination of the prototypical condi- 
tional distributions qj\ a weighted with probabilities p a \i 
(cf. [43]). Notice, that although p a \i defines a proba- 
bilistic assignment of objects to classes, these probabili- 
ties are not induced by the uncertainty of a hidden class 
membership of object X{ as is typically the case in mix- 
ture models. In the special case of X = y the SMM 
is equivalent to the word clustering model of Saul and 
Pereira [54] which has been developed parallel to our 
work. Comparing the graphical models in Fig. 1 and 
Fig. 2 it is obvious that the reparametrization simply 
corresponds to an arc reversal. 

2.4 Interpreting the SMM in terms of Cross 
Entropy 

To achieve a better understanding of the SMM con- 
sider the following cross entropy (Kullback-Leibler di- 
vergence) D between the empirical conditional distribu- 
tion rij\i = riij/rii of j/j given X{ and the conditional qj\i 
implied by the model, 
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Note, that the first entropy term does not depend on 
the parameters. Let a(S) = ^ s - n; ^- rij\i log rij\i and 




Figure 2: Graphical model representation for the Sepa- 
rable Mixture Model (SMM) using the asymmetric rep- 
resentation. 



rewrite the observed data log-likelihood in (2) as 
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N 



D[nj\. 
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, 8 logR - a(S) . (13) 



PMM is the reduction of the number of conditional dis- 
tributions pi\ a and qj\ a which also reduces the model 
complexity. In the SMM we have conditional distribu- 
tions for each class C a , while the PMM imposes addi- 
tional constraints, pi\ a = Pi\p, \iv a = vp and qj\ a = qj\p, 
if n a = up. This is illustrated by the graphical model in 
Fig. 3. 




Figure 3: Graphical representation for the Product 
Space Mixture Model (PMM). 



Since the estimation of pi = ri{ can be carried out in- 
dependently, the remaining parameters are obtained by 
optimizing a sum over cross entropies between condi- 
tional distributions weighted with n 8 -, the frequency of 
appearance of X{ in S, i.e., by minimizing the cost func- 
tion 



N 



n = J2 n i D i n 



j\i\1j\i 



(14) 



Because of the symmetry of the SMM an equivalent de- 
composition is obtained by interchanging the role of the 
sets X and y . 

2.5 Product-Space Mixture Model 

The abstract classes C a of a fitted SMM correspond to as- 
pects of observations, i.e., pair co-occurrences, and can- 
not be directly interpreted as a probabilistic grouping in 
either data space. But often we may want to enforce a 
simultaneous interpretation in terms of groups or proba- 
bilistic clusters in both sets of objects, because this may 
reflect prior belief or is part of the task. This can be 
achieved by imposing additional structure on the set of 
labels to enforce a product decomposition of aspects 



{l,...,K} = {l,...,K x }x{l,...,K y } 



(15) 



Each element a is now uniquely identified with a multi- 
index {vajHa). The resulting Product-Space Mixture 
Model (PMM) has the joint distribution 
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(16) 



Here 7r„ = i^ Vaiia is the probability to generate an obser- 
vation from a specific pair combination of clusters from 
X and y. The difference between the SMM and the 



The PMM with Kx X -classes and Ky 3^-classes is 
thus a constrained SMM with K = KxKy abstract 
classes. The number of independent parameters in the 
PMM reduces to 

(KxKy-l) + Kx(N-i) + Ky(M-i) . (17) 

Whether this is an advantage over the unconstrained 
SMM depends on the specific data generating process. 
The only difference in the fitting procedure compared to 
the SMM occurs in the M-step by substituting 
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(19) 



The E-step has to be adapted with respect to the mod- 
ified labeling convention. 

3 Clustering Models 

The grouping structure inferred by the SMM corre- 
sponds to a probabilistic partitioning of the observation 
space X x y . Although the conditional probabilities p a \i 
and q a y can be interpreted as class membership proba- 
bilities of objects in X or y, they more precisely corre- 
spond to object-specific distributions over aspects. Yet, 
depending on the application at hand it might be more 
natural to assume a typically unknown, but nevertheless 
definitive assignment of objects to clusters, in particular 
when the main interest lies in extracting grouping struc- 
ture in X and/or y as is often the case in exploratory 
data analysis tasks. Models where each object is as- 
signed to exactly one cluster are referred to as clustering 
models in the strict sense and they should be treated as 
models in their own right. As we will demonstrate they 
have the further advantage to reduce the model com- 
plexity compared to the aspect-based SMM approach. 



3.1 Asymmetric Clustering Model 

A modification of the SMM leads over to the Asymmet- 
ric Clustering Model (ACM). In the original formulation 
of the data generation process for the SMM the assump- 
tion was made that each observation (xi, j/j) is generated 
from a class C a according to the class-specific distribu- 
tion pi | a qj | a or, equivalently, the conditional distribu- 
tion qj\i was a linear combination of probabilities qj\ a 
weighted according to the distribution p a \i- Now we re- 
strict this choice for each object X{ to a single class. This 
implies that all j/j occuring in observations (xi,yj) in- 
volving the same object X{ are assumed to be generated 
from an identical conditional distribution Qj\ a - Let us 
introduce an indicator variable I{ a for the class member- 
ship which allows us to specify a probability distribution 

by 



P(xi,yj\I,P,q) 
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(20) 



The ACM can be understood as a SMM with I{ a re- 
placing p a \i- The model introduces an asymmetry by 
clustering only one set of objects X , while fitting class 
conditional distributions for the second set y. Obvi- 
ously, we can interchange the role of X and y and may 
obtain two distinct models. 

For a sample set S the log-likelihood is given by 
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The maximum likelihood equations are 
pi = rii/L, 
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The class-conditional distributions qj\ a are linear super- 
positions of all empirical distributions of objects X{ in 
cluster C a - Eq. (24) is thus a simple centroid condi- 
tion, the distortion measure being the cross entropy or 
Kullback-Leibler divergence. In contrast to (13) where 
the maximum likelihood estimate for the SMM mini- 
mizes the cross entropy by fitting Pi\ a , the cross entropy 
serves as a distortion measure in the ACM. The update 
scheme to solve the likelihood equations is structurally 
very similar to the A'-means algorithm: calculate as- 
signments for given centroids according to the nearest 
neighbor rule and recalculate the centroid distributions 
in alternation. 

The ACM is in fact similar to the distributional clus- 
tering model proposed in [43] as the minimization of 
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In distributional clustering, the KL-divergence as a dis- 
tortion measure for distributions has been motivated by 



the fact that the centroid equation (24) is satisfied at 
stationary points 3 . Yet, since after dropping the inde- 
pendent pi parameters and a data dependent constant 
we arrive at 
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(26) 



showing that the choice of the KL-divergence simply fol- 
lows from the likelihood principle. We like to point out 
the non-negligible difference between the distributional 
clustering cost function in (25) and the likelihood in (26), 
which weights the object specific contributions by their 
empirical frequencies n 8 -. This implies that objects with 
large sample sets Si have a larger influence on the opti- 
mization of the data partitioning, since they account for 
more observations, as opposed to a constant influence in 
the distributional clustering model. 

3.2 EM algorithm for Probabilistic ACM 

Instead of interpreting the cluster memberships 7 8a as 
model parameters, we may also consider them as unob- 
served variables. In fact, this interpretation is consistent 
with other common mixture models [35, 59] and might 
be preferred in the context of statistical modeling, in 
particular if N scales with L. Therefore consider the 
following complete data distribution 

P(S,I\p,p,q) = P(S\I,p,q)P(I\p), (27) 
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Here p specifies a prior probability for the hidden vari- 
ables, P(Iia = l\p) = p a - 

The introduction of unobservable variables 7 8a yields 
an EM-scheme and replaces the arg min-evaluation in 
(23) with posterior probabilities 
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The M-step is equivalent to (24) with posteriors replac- 
ing Boolean variables. Finally, the additional mixing 
proportions are estimated in the M-step by 
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Notice the crucial difference compared to SMM poste- 
riors in (6): Since all indicator variables R ra belonging 
to the same x, are identified, the likelihoods for obser- 
vations in Si are collected in a product before they are 
suitably normalized. To illustrate the difference consider 
the following example. Let a fraction s of the observed 



3 This is in fact not a unique property of the KL- 
divergence as it is also satisfied for the Euclidean distance. 



pairs involving X{ be best explained by assigning X{ to 
C a while the remaining fraction 1 — s of the data is best 
explained by assigning it to C v . Fitting the SMM model 
approximately results in p a \i = s and p v \i = 1 — s, ir- 
respective of the number of observations, because pos- 
teriors (I ra ) are additively collected. In the ACM all 
contributions first enter a huge product. In particular, 
for rii — ► oo the posteriors (J; a ) approach Boolean val- 
ues and automatically result in a hard partitioning of 
X. Compared to the original distributional clustering 
model as proposed in [43] our maximum likelihood ap- 
proach naturally includes additional parameters for the 
mixing proportions p. 

Notice that in this model, to which we refer as prob- 
abilistic ACM, different observations involving the same 
object Xi are actually not independent, even if condi- 
tioned on the parameters (p,p,q). As a consequence, 
considering the predictive probability for an additional 
observation s = (xi,yj,L + 1) requires to condition on 
the given sample set S, more precisely on the subset Si, 
which yields 

K 
P(s\S, p, p,q) = ^2 P(s\I ia =l,p, q) P(I ia =l\Si,p, q) 

a = l 

K 

= Pi ^2<lj\a(Iia) ■ (32) 



Thus we have to keep the posterior probabilities in ad- 
dition to the parameters (p, q) in order to define a (pre- 
dictive) probability distribution on co-occurrence pairs. 
The corresponding graphical representation in Fig. 4 
stresses the fact that observations with identical X- 
objects are coupled by the hidden variable Ii a 4 . 




Figure 4: Graphical representation for the Asymmetric 
Clustering Model (ACM). 



4 Notice that n, is not interpreted as a random variable, 
but treated as a given quantity. 



3.3 Symmetric Clustering Model 

We can classify the models discussed so far w.r.t. the 
way they model p a \i'- (i) as an arbitrary probability dis- 
tribution (SMM), (ii) as an unobserved hidden variable 
Iia), probabilistic ACM), (iii) as a Boolean vari- 



x\i 



ha G {0,1}, hard ACM). On the other 
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hand, no model imposes restrictions on the conditional 
distributions Qj\ a - This introduces the asymmetry in 
ACM, where p a \i and hence pi\ a is actually restricted, 
while qj\ a is not. However, it also indicates a way to 
derive symmetric clustering models, namely by impos- 
ing the same constraints on conditional distributions pi\ a 
and Qj\ a . Unfortunately, naively modifying the SMM for 
both object sets simultaneously does not result in a rea- 
sonable model. Introducing indicator variables Ii a and 
Jj a to replace p a \i and q a \j yields a joint probability 
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(33) 



which can be normalized to yield a valid probability dis- 
tribution, but which is zero for pairs (xi,yj), whenever 
2~2 a ha Jja = and therefore results in zero probabilities 
for each co-occurrence of pairs not belonging to the same 
cluster C a . 

For the PMM, however, the corresponding clustering 
model is more interesting. Let us introduce cluster asso- 
ciation parameters c vfl and define the joint probability 
distribution of the symmetric clustering model (SCM) by 



Pij — PiQj / j Hv ■-> j n^v n 7 



(34) 



where we have to impose the global normalization con- 
straint 



N M N M 

/ j / j Pi] = Z_^ / j PiQj / j HvijiiCvfi 
i—lj—1 i—lj—1 v,n 



(35) 



In the sequel, we will add more constraints to break cer- 
tain invariances w.r.t. multiplicative constants, but for 
now no restrictions besides (35) are enforced. 

Introducing a Lagrange multiplier A results in the fol- 
lowing augmented log-likelihood 

£ = ^ fijj ^ I iv J jfi (log Pi + log Qj + log c vfi ) 

i,j v,ii 



+^ ^2 PiQj y^jiyjjycvn - 1 



(36) 
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The corresponding stationary equations for the maxi- 
mum likelihood estimators are 



Pi 
Qj 



^E,/,^ livCv^Yjj Qj-h 



jm 



•* Z^y,/j J iV- C vn L^iPiHl 



^N 



, M 



i = l 2^i j=l n ij 1 ivJjn 

M'EZiPiti^CE^jLiQjJjv)' 



(37) 
■-J2i n ij, (38) 

(39) 



together with (35) which is obtained from (36) by differ- 
entiation with respect to A. Substituting the right-hand 
side of (39) into (35) results in the following equation for 
A, 



The coupled K-means like equations for either set of 
discrete variables are obtained by maximizing the aug- 
mented likelihood in (36) from which we deduce 
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else 
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Inserting A = — L into (39) gives an explicit expression 

for c v)i which depends on p and q. Substituting this Tne expression for h iv further simplifies, be 



niYrij 

1 

L 



^-v fJ, 



expression back into (37,38) yields 
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(41) 



and an equivalent expression for qj . Observing that the 
fraction on the right-hand side in (41) does not depend 
on the specific index i, the self-consistency equations can 
be written as pi = ri{ ~Y^ V I{ v a v . It is straightforward to 
verify that: (i) any choice of constants a„ £ HI gives a 
valid solution and (ii) each such solution corresponds to 
a (local) maximum of C. This is because a simultaneous 
re-scaling of all pi for which I{ v = 1 is compensated by a 
reciprocal change in c vfl as is obvious from (34) or (39), 
leaving the joint probability unaffected. 

We break this scale invariance by imposing the ad- 
ditional conditions ^rIjv = ^% = ^2i n iliv I L and 
2 j IjJjp- = ""u — X2j m jJjf/L on the parameters. This 
has the advantage to result in the simple estimators 
pi = rii/L and qj = rrij/L, respectively. The proposed 
choice decouples the estimation of p and q from all other 
parameters. Moreover it supports an interpretation of pi 
and qj in terms of marginal probabilities, while tt x and 
tt ,j correspond to occurrence frequencies of objects from 
a particular cluster. With the above constraints the final 
expression for the maximum likelihood estimates for c vfl 
becomes 



--V fJ, 






N M 



where 



VV-^-J- J- 



(42) 
(43) 



TT vfi can be interpreted as an estimate of the joint proba- 
bility for the cluster pair [y, //). Notice that the auxiliary 
parameters are related by marginalization ^2 ^vp, = rf 
and ^2 v Tt vll = tt ,j . The maximum likelihood estimate 
c vfl is a quotient of the joint frequency of objects from 
clusters C x and C y and the product of the respective 
marginal cluster frequencies. If we treat c vfl as func- 
tions of I, J and insert (42) into (36), this results in 
a term which represents the average mutual informa- 
tion between the random events x, £ C x v and j/j £ C y . 
Maximizing C w.r.t. I and J thus maximizes the mutual 
information which is very satisfying, since it gives the 
SCM a precise interpretation in terms of an information 
theoretic concept. A similar criterion based on mutual 
information has been proposed by Brown et al. [4] in 
their class-based n-gram model. More precisely their 
model is a special case of the (hard clustering) SCM, 
where formally X = y and Ii v = J 8 '„. 5 
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is a constant independent of the cluster index v which 
can be dropped in the maximization in (44). It has to 
be stressed that the manipulations in (45) have made 
use of the fact that the J-variables appearing in (42) 
and (45) can be identified; the c-estimates have thus to 
be J-consistent 6 . This can be made more plausible by 
verifying that for given J and J-consistent parameters, 
the marginalization ^ . pij = pi holds independently of 
the specific choice of I. This automatically ensures the 
global normalization since ^ s - pi = 1 which in turn ex- 
plains why the global constraint does not affect the op- 
timal choices of I according to (44). Similar equations 
can be obtained for J, 
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1 if a 

else 



arg max,, £\ n^ Y, v hv log ( 



The nature of the simultaneous clustering suggests an al- 
ternating minimization procedure where the X -partition 
is optimized for a fixed 3^-partition and vice versa. Af- 
ter each update step for either partition the estimators 
c vfi have to be updated in order to ensure the validity of 

(45), i.e., the update sequence (I, c, J, c) is guaranteed 
to increase the likelihood in every step. Notice that the 
cluster association parameters effectively decouple the 
interactions between assignment variables Ii v , Ik v for 
different X{, x^ and between assignment variables J J/U , 
Jifi for different j/j , y\ . Although we could in principle 
insert the expression in (36) directly into the likelihood 
and derive a local maximization algorithm for I and J 
(cf. [4]), this would result in much more complicated 
stationary conditions than (44). The decoupling effect is 
even more important for the probabilistic SCM derived 
in the next section. 

3.4 Probabilistic SCM 

As for the ACM, we now investigate the approach pre- 
ferred in statistics and treat the discrete I and J as hid- 
den variables. The situation is essentially the same as 
for the ACM. The complete data distribution for the 
probabilistic SCM (PSCM) is given by 



P(S,I,J\p x ,py,c) 
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J For the bigram model in [4] this implies that the word 



classes are implicitly utilized in two different ways: as classes 
for predicting and for being predicted. It is not obvious that 
this is actually an unquestionable choice. 

6 Hereby we mean that c v[1 depends on J by the formula in 
(42), where I can be an arbitrary partitioning of the ,¥-space. 
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and the predictive probability for s 
involves joint posteriors, 



(xi,yj,L+ 1) now 



P(s\S,...) = Piqj22(Ii v Jjv)c vll . (47) 

The M-step equations are obtained from (42) by replac- 
ing (products of) Boolean variables by their posteriors. 
The estimates for the p-parameters are given by 



P : (t) 



Z-^i = lViv) 






Ef =1 <^) (t) 



(48) 



N ' M M 

The coupling of I and J makes the exact computa- 
tion of posteriors in the E-step intractable. To preserve 
tractability of the procedure we propose to apply a fac- 
torial approximation (called mean-field approximation, 
cf. Appendix A), (I^Jj^) ~ {Iiv){Jjn), which results in 
the following approximations for the marginal posterior 
probabilities 



(I iv ) oc p x v exp 



■^n ii ^(J J> )log ( 



(49) 



and a similar equation for (Jjfi). The mean-field equa- 
tions can be more intuitively understood as a soft-max 
versions of the hard clustering equations with additional 
priors p x , p y . Alternatively, one may also apply Markov 
chain Monte Carlo (MCMC) methods to approximate 
the required correlations. Yet, the mean-field approxi- 
mation has the advantage to be more efficient due to its 
deterministic nature. Notice that the mean-field condi- 
tions (49) form a highly non-linear, coupled system of 
equations. A solution is found by a fixed-point iteration 
which alternates the update of posterior marginals with 
an update of the continuous parameters ((/), (c, p x ), (J), 
(c,p y ) sequence). This optimizes a common objective 
function in every step and always maintains a valid prob- 
ability distribution. 7 

3.5 Overview 

Altogether we have derived six different model types for 
COD 8 which are summarized in the following systematic 
scheme in Table 1. As can be seen the models span a 
large range of model complexity by imposing constraints 
on the class conditional distributions pi\ a and Qj\ a . 

4 Hierarchical Clustering Model 

4.1 Model Specification 

In this section we present a novel hierarchical generative 
model, called HACM. Assume therefore a tree topology 



The EM approach for SCM has problems when started 
from a random initialization, because all posteriors typically 
approach uniform distributions. A remedy which we applied 
is to utilize a solution of the ACM to initialize one arbitrarily 
selected set of hidden variables. 

8 We ignore the variants obtained from asymmetric models 
by reversing the role of X and y for simplicity. 



Model 




Pi\a 

a = (v 


p) 


Qj\a 

a = (v, fi) 


SMM 


unconstr. 


unconstr. 


PMM 


Vi\v 


1i\» 


(hard) ACM 


IicPi 

7T a 


unconstr. 


(probabilistic) 


ACM 


{Iio,)pi 

7T a 


unconstr. 


(hard) SCM 


7r£ 




(probabilistic) 


SCM 


7r£ 




(Jiv)qj 



Table 1: Systematic overview of presented COD models. 



T on the clusters to be given, e.g., a complete binary 
tree. Clusters C a are identified with the terminal nodes 
of T. Conditional probabilities qj\ a are not only attached 
to the leaves, but also to all inner nodes of the tree. The 
HACM involves two stages. In the first step, which is 
similar to the ACM case, each object X{ is assigned to 
one component or cluster C a represented by an (unob- 
served) variable I{ a . Now, instead of generating all ri{ 
observations from the conditional distribution Qj\ a , a sec- 
ond probabilistic sampling step is involved by selecting a 
resolution or abstraction level. Hence, the first step is a 
selection from a horizontal mixture of possible paths for 
each object X{, while the second step is a probabilistic se- 
lection of a component from the vertical mixture of nodes 
on the selected path for each observation. To model the 
vertical selection we introduce a second set of hidden 
variables V rv which encode the resolution level A v for 
the r-th observation. Notice the different nature of both 
sets of variables: the I variables represent a partitioning 
in the X space similar to the ACM, while V partitions 
the co-occurrence space X x y like in the SMM. Obvi- 
ously, the hidden variables are not independent and have 
to fulfill the following set of constraints induced by T: 
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v rv = 1, Vr, 



(50) 



where A v \ C a denotes the nodes A v 'above' C a , i.e., 
nodes on the path to C a . The constraints ensure that 
once Xi is assigned to a cluster C a all observation in Si 
are restricted to be generated from one of the abstraction 
nodes A v on the path above C a . A pictorial represen- 
tation can be found in Fig. 5: for an object X{ assigned 
to C a the choices for abstraction levels of observations 
(xi,yj,r) are restricted to the 'active' (highlighted) ver- 
tical path. The 'tension' in this model is imposed by the 
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Figure 5: Scheme for data generation with the HACM. 



constraints on Ii( r ) a V rv as opposed to the independent 
choices R ra in the SMM. 

To complete the specification of the HACM one has 
to specify prior probabilities for V rv . The most general 
choice is to condition the priors on the terminal node 
selected by the I variables and on X{ itself, i.e., to intro- 
duce parameters T v \ a ^. This assumes that each object 
has a specific distribution over abstraction levels. For 
simplicity, the constraints in (50) are incorporated in 
the prior by setting T v \ a ^ = whenever A v J C a . With 
the above definitions the complete data log-likelihood of 
the HACM is given by 



+ ^2 Hi \ogpi + ^2 E Iia log pa ■ ( 51 ) 



The corresponding representation in terms of a graphical 
model is shown in Fig. 6. 

We may think of the HACM as a mixture model with 
a horizontal mixture of clusters and a vertical mixture 
of abstraction levels. Each horizontal component is a 
mixture of vertical components on the path to the root, 
vertical components being shared by different horizontal 
components according to the tree topology. The HACM 
is more general than the ACM in that it allows to use a 
different abstraction level for each observation. The class 
conditional probability gju is now modeled by a vertical 
mixture offering additional degrees of freedom. In fact 
the ACM is retrieved as a special case by setting T v \ a> i = 
6 va . On the other hand, the HACM is more constrained 
than the SMM since the mixing of component densities 
qj\ a is restricted to nodes on a single path through the 
tree. This restriction precisely expresses why we obtain a 
hierarchical clustering organization of objects in X and, 
simultaneously, an abstractive organization of objects in 
y . The dual organization is in particular interesting for 
information retrieval where the HACM can be used to 
cluster documents (= X ) in a hierarchical fashion and 
simultaneously assign abstraction levels to the different 
keywords (= y). 



4.2 EM Algorithm for the HACM 

Skipping the derivation of the maximum likelihood equa- 
tions for the hard clustering case of HACM, we directly 
continue with the probabilistic EM version. 

In the E-step we need to calculate posterior probabil- 
ities (Ii(r) a V rv ). Since the values of the clustering vari- 
ables Ii a restrict the admissible values of V rv , we may 
compute the joint posterior probabilities from the chain 
rule 

P{Ii(r)aVrv= l\S , 6) = P(V rv = 1 |<S,( r ), 8, Ii( r ) a = 1) 

P(2j (r)a =l|.* (r) ,0), (52) 

where 8 = (p,q,p,r) summarizes all continuous param- 
eters. The conditional posterior probabilities for V rv are 
given by 
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The marginal posteriors can by definition be computed 
from 
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which yields 
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completing the E-step. 

In the M-step all parameters are updated according 
to the following set of equations 
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Using the HACM in large-scale applications requires 
a closer investigation of the computational complexity 




Figure 6: Graphical representation for the Hierarchical 
Asymmetric Clustering Model (HACM). 

of the EM model fitting procedure. The major com- 
putational burden of the algorithm is the re-calculation 
of posterior probabilities {V rv \ a ) in the E-step. To ac- 
celerate the EM algorithm we exploit the fact that the 
hierarchical and abstractive organization does not criti- 
cally depend on the parameters T v \ a ^. By this we mean, 
that even setting T v \ a ^ = const, (for all A v \ C a ) will 
result in a reasonable model. In fact, to learn a;;- specific 
distribution of the vertical mixtures is more of a fine tun- 
ing which may improve the model performance once the 
essential structure has been identified. An intermediate 
choice which worked well in our experiments is to set 
T v \ U) i = T v \ a , i.e., to introduce priors which are shared 
by all objects X{ belonging to the same cluster C a . The 
simplified E-step has the advantage that {V rv \ a ) does not 
depend on xu r ). Moreover it reduces the model complex- 
ity and may prevent overfitting. In the simplified model, 
instead of computing posteriors for all L observations, it 
suffices to compute posterior probabilities for all M y— 
objects. This can result in a significant acceleration, for 
example, in natural language modeling where M would 
be the size of the vocabulary as compared to the size R 
of the corpus of word occurrences which typically differ 
by several orders of magnitude. For r v \ a = const, an 
additional speed-up results from a simplified propaga- 
tion of posteriors in the tree. In our simulation we have 
thus pursued a three-stage strategy, where the degrees 
of freedom are incrementally increased. 

Like for the ACM it is straightforward to check that 
the predictive probabilities for s = (xi,yj,L + 1) are 
given by 



P(s\S, 
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4.3 Hierarchies and Abstraction 

With other hierarchical mixture models proposed for 
supervised [3, 29] and unsupervised learning [37] the 
HACM shares the organization of clusters in a tree struc- 
ture. It extracts hierarchical relations between clusters, 
i.e., it breaks the permutation-symmetry of the cluster 
labeling. Even more important, however, it is capable 
to perform statistical abstraction. Observations which 
are common to all clusters in the subtree rooted at an 



inner node are preferably captured at the level of general- 
ity represented by that node. Observations being highly 
specific are 'explained' on the terminal level. Therefore 
inner nodes do not represent a coarser view on the data 
which could be obtained, e.g., by a weighted combina- 
tion of the distributions at successor nodes and as might 
be expected for a hierarchical model. 9 The vertical mix- 
tures perform a specialization in terms of the level of gen- 
erality most adequate for each of the observations. Thus 
the HACM incorporates a novel notion of hierarchical 
modeling, which differs from multiresolution approaches, 
but also from other hierarchical concepts of unsupervised 
learning (e.g. [11]). It offers several new possibilities in 
data analysis and information retrieval tasks like extract- 
ing resolution-dependent meaningful keywords for sub- 
collection of documents and gives a satisfying solution 
to the problem of cluster summarization (cf. [8]) since 
it explicitly finds the most characteristic terms for each 
(super-)cluster of documents. 

There are several additional problems which have to 
be solved to arrive at a complete algorithm for the 
HACM. The most important concerns the specification 
of a procedure to obtain the tree-topology T. Before 
explaining our heuristic we have to introduce the impor- 
tant concept of annealing. 

5 Improved EM Variants 

5.1 Annealed EM 

So far we have mainly focused on the modeling prob- 
lem of defining mixture models for COD. The standard 
model fitting procedure has been the EM algorithm and 
its hard clustering variants. We now discuss two im- 
portant problems which naturally occur in this context. 
The first problem is to avoid unfavorable local maxima 
of the log-likelihood. The second, even more important 
problem is to avoid overfitting, i.e., maximize the per- 
formance on unseen future data. The framework which 
allows us to improve the presented EM procedures in 
both aspects is known as deterministic annealing. 

Deterministic annealing has been applied to many 
clustering problems, including vectorial clustering [49, 
50, 5], pairwise clustering [23], and in the context of 
COD for distributional clustering [43]. The key idea is 
to introduce a temperature parameter T and to replace 
the minimization of a combinatorial objective function 
by a substitute known as the free energy. Details on 
this topic are given in Appendix A. Here, we present an- 
nealing methods without reference to statistical physics. 
Consider therefore the general case of maximum likeli- 
hood estimation by the EM algorithm. The E-step by 
definition computes a posterior average of the complete 
data log-likelihood which is maximized in the M-step. 
The annealed E-step at temperature T performs this av- 
erage w.r.t. a distribution which is obtained by generaliz- 
ing Bayes' formula such that the likelihood contribution 
is taken to the power of l/T. For T > 1 this amounts 
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In fact, it is also important to develop hierarchical gen- 
eralizations of the kind. However, we focus on the HACM 
which is more specific to COD. 



to increasing the effect of the prior which in turn will re- 
sult in a larger entropy of the (annealed) posteriors. For 
example, in the case of the ACM the annealed E-step 
generalizing (21) is given by 10 
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For (hard) clustering applications deterministic anneal- 
ing is utilized in its usual T —> limit. Although there 
is no guarantee that deterministic annealing in general 
finds the global minimum, many independent empirical 
studies indicate that the typical solutions obtained are 
often significantly better than the corresponding 'unan- 
nealed' optimization. This is due to the fact that an- 
nealing is a homotopy method, where the (expected) like- 
lihood as a cost function is smoothed for large T and is 
recovered in the limit T — *■ 1. 

In addition, for fixed T > 1 the annealed E-step per- 
forms a regularization based on entropy, because the pos- 
terior probabilities minimize the generalized free energy 
at T = 1 which balances expected costs and (relative) 
entropy [38] (cf. Appendix A). This is the reason why 
annealed EM not only reduces the sensitivity to local 
minima but also controls the effective model complexity. 
It thereby has the potential to improve the generaliza- 
tion for otherwise overfitting models. The advantages of 
deterministic annealing are investigated experimentally 
in Section 6. 

In addition, the annealed EM algorithm offers a way 
to generate tree topologies. As is known from adaptive 
vector quantization [49], starting at a high value of T 
and successively lowering T leads through a sequence of 
phase transitions. At each phase transition the effective 
number of distinguishable clusters grows until some max- 
imal number is reached or the annealing is stopped. This 
suggests a heuristic procedure where we start with a sin- 
gle cluster and recursively split clusters. In the course of 
the annealing one keeps track of the splits and uses the 
'phase diagram' as a tree topology T. Note that merely 
the tree topology is successively grown, while the data 
partition obtained at a specific temperature is regrouped 
and may drastically change during the annealing process. 

To summarize, annealed EM solves three problems at 
once: 

1. It avoids unfavorable local minima by applying a 
temperature based continuation method, as the 
modified likelihood becomes convex at high tem- 
perature, 



Eq. (60) differs from the original formula by Pereira et 
al. [43] in that it scales the temperature with the frequency n t 
and includes the mixing proportions. As pointed out before 
this is naturally obtained in the ML framework, while in the 
distributional clustering cost function (25) the weights n t are 
not considered. Canceling these weights may be a reasonable 
approach to limit the effect of frequently observed 'objects' x t 
on the organization of clusters. From a statistical viewpoint, 
however, the latter is implausible, because more observations 
should automatically sharpen the posterior distributions at a 
given temperature level T. 



2. it avoids overfitting by discounting the likelihood 
contribution in the E-step 

3. it offers a physically motivated heuristic to produce 
a meaningful tree topology (for the HACM). 

In all experiments and for all statistical models we have 
therefore utilized the annealed variant of EM. 

5.2 Predictive EM 

Another modification of the E-step to improve gener- 
alization is worth considering. For notational simplic- 
ity focus on the E-step in the SMM as given by (6). If 
the parameters are eliminated by substituting them with 
their current M-step estimators in terms of posteriors as 
given by (7-9), we arrive at 
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where i = i(r) and j = j(r). Eq. (61) reveals that in 
estimating (R ra ) in the E-step its old estimator actu- 
ally appear on the right hand side for u = r. This 
has the effect to systematically overestimate high pos- 
terior probabilities while small posteriors are underes- 
timated. This positive feedback on the posteriors may 
lead to substantial overfitting phenomena. To illustrate 
this problem consider the extreme case, where ri{ = 1, 
e.g., [xi, yi, r) £ S is the only observation with X{. Then 
the stationary condition for the E-step is given by 
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and hence qj\ a = J2a'( R ra')qj\a' whenever (Ri a ) > 
which is only fulfilled if (R ra ) = 1 (being a stable solu- 
tion for a = argmax„' Qj\ a ')- F° r sparse data the diago- 
nal contribution can thus be dominant and the positive 
feedback bears the risk of overfitting. 

In order to overcome these problems we propose a 
variant of EM which we refer to as predictive EM. The 
only modification is to exclude the r-th observation in 
recalculating posteriors (R ra ). The class membership of 
the r-th observation is predicted based on the remain- 
ing samples. 11 For the SMM this implies that diagonal 
contributions are excluded in (61). It is obvious that the 
proposed correction does only have a minor influence on 
the computational complexity of the fitting procedure. 
Despite the heuristic flavor caused by modifying an al- 
gorithmic step, the predictive EM can be motivated from 
strict optimization principles. Further details and con- 
vergence considerations are given in Appendix B. 

In conclusion we would like to stress the fact, that 
although the positive feedback occurs in other EM al- 
gorithms, it is most severe for COD models, because of 
the inherent sparseness problem. Furthermore, other er- 
ror measures like the squared error between a data vector 
and a cluster centroid are far less sensitive to this type of 
problems then is the cross entropy involving logarithms 
of small probabilities. 
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This may result in undefined posterior probabilities due 
to zeros, as in the above example. In this case we assume the 
posterior to be uniform. 



5.3 Accelerated EM 

EM algorithms have important advantages over 
gradient-based methods, however for many problems the 
convergence speed of EM may restrict its applicability 
to large data sets. A simple way to accelerate EM al- 
gorithms is by overrelaxation in the M-step. This has 
been discussed early in the context of mixture models 
[44, 45] and was 'rediscovered' more recently under the 
title of EM(rj) in [1]. We found this method useful in 
accelerating the fitting procedure for all discussed mod- 
els. Essentially the estimator for a generic parameter 6 
in the M-step is modified by 



(*+i) 



(1-7,) 



(*). 



«(* + !) 



(63) 



a(* + l) 



where 9 is the M-step estimate, i.e., r, = 1 is the 

usual M-step. Choosing 1 < r, < 2 still guarantees con- 
vergence, and typically r, ps 1.8 has been found to be 
a good choice to speed up convergence. In case that 
a constraint is violated after performing an overrelaxed 
M-step, the parameter set is projected back on the ad- 
missible parameter space. For an overview on more elab- 
orated acceleration methods for EM we refer to [36]. 

5.4 Multiscale Optimization 

Multiscale optimization [21, 46] is an approach for ac- 
celerating clustering algorithms whenever a topologi- 
cal structure exists on the object space(s). In image 
segmentation, for example, it is a natural assumption 
that adjacent image sites belong with high probabil- 
ity to the same cluster or image segment. This fact 
can be exploited to significantly accelerate the estima- 
tion process by maximizing over a suitable nested se- 
quence of variable subspaces in a coarse-to-fine manner. 
This is achieved by temporarily tying adjacent sites in 
a joint assignment variable. For notational convenience 
we again restrict the presentation to the ACM, while ex- 
tensions to the SCM and both probabilistic variants are 
straightforward . 1 2 

More formally a coarsening hierarchy for X is given by 
a nested sequence of equivalence relations J\A"> over X, 
where M^ C M (l+1 ^ and M^ = {(^,^) : x { G X}. 
In the context of image analysis these equivalence rela- 
tions typically correspond to multi-resolution pixel grids 
obtained by subsampling. The log-likelihood in (21) is 
minimized at coarsening level / by imposing constraints 
of the form I{ a = Ij a whenever (xi,Xj) G M">. For all 
models under consideration this effectively amounts to 
reducing the number of indicator functions to one set of 
variables for each equivalence class in M"\ while pre- 
serving the functional form of the likelihood, thus en- 
abling highly accelerated optimization at coarser levels. 
Once the maximization procedure at a resolution level / 
is converged, the optimization proceeds at the next level 
/ — 1 by prolongating the found solution in J\A"> to the 
subset defined by M"~ \ thus initializing the optimiza- 
tion at level / — 1 with the solution at level /. For the 
probabilistic version multiscale optimization amounts to 
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0.06 


231 


0.06 


179 


0.40 
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12 Multiscale optimization in its current form is not appli- 
cable to the SMM/PMM. 
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Table 2: Perplexity results for different models (SMM, 
ACM, HACM, SCM) on two data sets (CRAN: predict- 
ing words conditioned on documents, PENN: predicting 
nouns conditioned on adjectives) based on ten-fold cross 
validation (K x = K y = K for SCM). 



modifying the E-step of the EM algorithm by computing 
posteriors for the reduced sets of indicator functions. 

We like to emphasize that in contrast to most mul- 
tiresolution optimization schemes, multiscale optimiza- 
tion has the advantage to maximize the original log- 
likelihood at all resolution levels. It is only the set of 
hidden variables which is effectively reduced by impos- 
ing the constraints on the set of hidden variables MS>. 
We applied multiscale optimization in all image analysis 
experiments resulting in typical accelerations by factors 
10-100 compared to single-level optimization. 

6 Results 

6.1 Information Retrieval 

Information retrieval in large databases is one of the key 
topics in data mining. The problem is most severe in 
cases where the query cannot be formulated precisely, 
e.g., in natural language interfaces for documents or in 
image databases. Typically, one would like to obtain 
those entries which best match a given query according 
to some similarity measure. Yet, it is often difficult to 
reliably estimate similarities, because the query may not 
contain enough information, e.g., not all possibly rele- 
vant keywords might occur in a query for documents. 
Therefore, one often applies the cluster hypothesis [62]: 
if an entry is relevant to a query, similar entries may also 
be relevant to the query although they may not possess 
a high similarity to the query itself due to the small 
number of keywords. Clustering thus provides a way of 
pre-structuring a database for the purpose of improved 
information retrieval, cf. [63] for an overview. Both types 
of clustering approaches, for the set of documents as well 
as for the keywords, have been proposed in the litera- 
ture. The most frequently used methods in this context 
are linkage algorithms (single linkage, complete linkage, 
Wards method, cf. [26]), or hybrid combinations of ag- 
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Figure 7: (a) Generalization performance and (b) training likelihood for the annealed EM at different temperatures 
on the Cranfield collection. 
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Figure 8: (a) Training likelihood and (b) generalization performance for the (annealed) predictive EM variant at 
different temperatures on the Cranfield collection. 



glomerative and centroid-based methods [8] which have 
no probabilistic interpretation and have a number of 
other disadvantages. In contrast, COD mixture models 
provide a sound statistical basis and overcome the fun- 
damental sparseness problem of proximity-based clus- 
tering. In particular the hierarchical clustering model 
(HACM) has many additional features which make it a 
suitable candidate for interactive or coarse-to-fine infor- 
mation retrieval. 

We have performed experiments for information re- 
trieval on different collections of abstracts. The index 
terms for each dataset have been automatically extracted 
from all documents with the help of a standard word 
stemmer. Following [62], a list of stop words has been 
utilized to exclude frequently used words. Words with 
few overall occurrences have also been eliminated. The 
documents are identified with the set of objects X , while 
index terms correspond to y . 

In the first series of experiments we have investi- 
gated how well the different models perform in pre- 
dicting the occurrences of certain words in the con- 



text of a particular document. Therefore, the set of 
all word occurrences has been divided into a training 
and a test set. From a statistical point of view the 
canonical goodness-of-fit measure is the average log- 
likelihood on the test set. Yet, in the context of nat- 
ural language processing it is more customary to utilize 
the perplexity V which is related to the average test 
set log-likelihood / by V = exp(— /). Since we have 
used the annealed EM algorithm, a validation set was 
utilized to determine the optimal choice of the compu- 
tational temperature. Comparative results for all dis- 
cussed models 13 on the standard test collection Cran- 
field (CRAN, N=1400,M=1664,L=111803) are summa- 
rized in Table 2. For all experiments we have performed 
a ten-fold cross validation. The main conclusions are: 

• The lowest perplexity is obtained with the SMM. 
The HACM performs better than the more con- 
strained ACM and SCM. Hence, in terms of per- 
plexity the linear mixture models should be pre- 
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13 The performance of the PMM is comparable to the SMM 
and hence not displayed. 
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Figure 9: Upper part of a cluster hierarchy (6 levels) for the CLUSTER dataset generated by annealed EM for 
HACM. Each node is described by the index words yj with the highest probability Qj\ u . 



ferred over the clustering models. 

• The optimal temperature for the SMM is consis- 
tently below T = 1 which is the standard EM algo- 
rithm. For the clustering models the optimal gener- 
alization performance even requires a much higher 
temperature as expected. 

• Temperature-based complexity control clearly does 
much better than restricting the number K of com- 
ponents. Even the SMM with K = 8 components 
suffers from overfitting at T = 1. 

To stress the advantages of the proposed EM variants, 
we have investigated the effect of a temperature-based 
regularization in more detail. Fig. 7 shows log-likelihood 
curves for typical runs of the annealed EM algorithm at 
different temperatures. 14 The overfitting phenomenon is 
clearly visible, e.g., for the SMM at T = 1, where the 
test-set performance degrades after 30 iterations. No- 
tice, that annealing performs much better than early 
stopping. A comparison of the predictive EM vari- 
ant with the standard EM for the SMM is depicted in 
Fig. 8. This demonstrates that even the (presumably) 



These simulations have been performed on COD from 
the CRAN collection. Qualitatively similar results have been 
obtained for all other data sets. 
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slight modification to avoid positive feedback improves 
the test-set performance. The overrelaxed EM variant 
has also proven to be a valuable tool in our simulations 
with a typical acceleration by a factor 2 — 3. 

To facilitate the assessment of the extracted structure 
we have investigated a dataset of TV = 1584 documents 
containing abstracts of papers with clustering as a ti- 
tle word (CLUSTER). This data is presumably more 
amenable to an interpretation by the reader than are 
the standard text collections. The top-levels of a cluster 
hierarchy generated by HACM are visualized in Fig. 9. 

To demonstrate the ability of the HACM to identify 
abstraction levels in the hierarchy, we have visualized 
the distribution of the responsibility for observations in- 
volving the same index word yj for some particularly 
interesting examples in Fig. 10. The first tree for the 
word 'cluster' shows that, as expected, the occurrences 
of 'cluster' in documents are explained to be a common 
feature of all documents, hence most of the occurrences 
are assigned to the root. The word 'decision' is found 
on a level 3 node, indicating that it is a typical word for 
all algorithmically oriented documents assigned to nodes 
in the subtree, but e.g. not for the left branch of papers 
from physics and astronomy. The index term 'robust' oc- 
curs in two different meanings: first, it has a highly spe- 




Figure 10: Exemplary relative word distributions over nodes for the CLU STER dataset for the keywords 'cluster' 
'decision', 'glass', 'robust', 'segment', and 'channel'. 



cific meaning in the context of stability analysis ('plane', 
'perturb', 'eigenvalue', 'root', etc.) and a rather broad 
meaning in the sense of robust methods and algorithms. 
The word 'segment' occurs mainly in documents about 
computer vision and language processing, but it is used 
to a significant larger extend in the first field, 'glass' is 
a specific term in solid state physics, it thus is found on 
the lowest level of the hierarchy, 'channel' is again am- 
bivalent, it is used in the context of physics as well as in 
communication theory. The bimodal distribution clearly 
captures this fact. 

The same experiments have been carried out for a 
dataset of 1278 documents with abstracts from the jour- 
nals Neural Computation and Neural Networks (NN). 
The solution of a HACM with K = 32 clusters is vi- 
sualized in Fig. 11, each node is again described by the 
index words with the highest probability. 

These examples are only spotlights, but they demon- 
strate that the hierarchical organization obtained by the 
HACM is able to extract interesting structure from co- 
occurrence data. A detailed investigation of its full po- 
tential in the context of information retrieval is beyond 
the scope of this paper and will be pursued in future 
work. 

6.2 Computational Linguistics 

In computational linguistics, the statistical analysis of 
word co-occurrences in lexical structures like adjec- 
tive/noun or verb/direct object has recently received a 



considerable degree of attention [22, 43, 9, 10]. Potential 
applications of these methods are in word-sense disam- 
biguation, a problem which occurs in different linguis- 
tic tasks ranging from parsing and tagging to machine 
translation. 

The data we have utilized to test the different mod- 
els consists of adjective-noun pairs extracted from a 
tagged version of the Penn Treebank corpus (N = 6931, 
M = 4995, L = 55214) and the LOB corpus (N = 5548, 
M = 6275, L = 36723) 15 . Performance results on the 
Penn dataset are reported in the second half of Table 2. 
The results are qualitatively very similar to the ones ob- 
tained on the CRAN document collection, although this 
application is quite different from the one in information 
retrieval. 

A result for a simultaneous hard clustering of the LOB 
data with the SCM is reported in Fig. 12. The visual- 
ization of the TT vfi matrix reveals that many groups in 
either space are preferably combined with mainly one 
group in the complementary space. For example the ad- 
jective group 'holy', 'divine', 'human' has its occurrences 
almost exclusively with nouns from the cluster 'life', 'na- 
ture', 'being'. Some groups are very much indifferent 
with respect to the groups in the corresponding set, e.g., 
the adjective group headed by 'small', 'big', 'suitable'. ' 
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3 Singular and plural forms have been identified. 



network 0.102 
neural 0.052 
function 0.0314 
result 0.0311 
genei 0.028 



Level5 

Level4 
Level3 

Level2 

Level la 
Level lb 

LevelOaa 
LevelOab 

LevelOba 
LevelObb 



Figure 11: Upper part of a cluster hierarchy (6 levels) for the NN dataset generated by annealed EM for HACM. 
Each node is described by the index words j/j with the highest probability Qj\ v . 
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6.3 Unsupervised Texture Segmentation 

The unsupervised segmentation of textured images is one 
of the most challenging and still only partially solved 
problems in low level computer vision. Numerous ap- 
proaches to texture segmentation have been proposed 
over the past decades, most of which obey a two-stage 
scheme: 

1. A modeling stage: characteristic features are ex- 
tracted from the textured input image, which range 
from spatial frequencies [25, 24], MRF-models 
[33, 34], co-occurrence matrices [16] to fractal in- 
dices [6]. 

2. In the clustering stage features are grouped into 
homogeneous segments, where homogeneity of fea- 
tures has to be formalized by a mathematical no- 
tion of similarity. 

Most widely, features are interpreted as vectors in a Eu- 
clidean space [25, 33, 34, 65, 40, 6, 31] and a segmenta- 
tion is obtained by minimizing the A'-means criterion, 
which sums over the square distances between feature 
vectors and their assigned, group-specific prototype fea- 
ture vectors. A'-means clustering can be understood as 
a statistical mixture model with isotropic Gaussian class 
distributions. 
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Occasionally, the grouping process has been based on 
pairwise similarity measurements between image sites, 
where similarity is measured by a non-parametric sta- 
tistical test applied to the feature distribution of a 
surrounding neighborhood [16, 39, 24]. Agglomerative 
techniques [39] and, more rigorously, optimization ap- 
proaches [24, 57] have been developed and applied for 
the grouping of similarity data in the texture segmenta- 
tion context. Pairwise similarity clustering thus provides 
an indirect way to group (discrete) feature distributions 
without reducing information in a distribution to their 
mean. 

COD mixture models, especially the ACM model, for- 
malize the grouping of feature distribution in a more di- 
rect manner. In contrast to pairwise similarity cluster- 
ing, they offer a sound generative model for texture class 
description which can be utilized in subsequent process- 
ing stages like edge localization [56]. Furthermore, there 
is no need to compute a large matrix of pairwise simi- 
larity scores between image sites, which greatly reduces 
the overall processing time and memory requirements. 
Compared to the mixture of Gaussian model, ACM pro- 
vides significantly more flexibility in distribution model- 
ing. Especially in the texture segmentation application 
class features often exhibit a non-Gaussian, e.g., multi- 
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Figure 12: Clustering of LOB using the SCM (K x = K^ = 32) with a visualization of the ir,,^ matrix and a 
characterization of clusters by their most probable words. 



modal distribution, which is the main reason for the suc- 
cess of pairwise similarity clustering approaches and the 
ACM compared to standard A'-means. 

The ACM is applicable to any feature extraction pro- 
cess. As an example we exploit a Gabor filter image 
representation. More specifically, in all experiments we 
used the modulus of a bank of Gabor filters with 4 ori- 
entations at 3 scales with an octave spacing, resulting in 
a 12 dimensional feature vector associated to each image 
site. In our experiments the features were discretized 
separately for each dimension using 40 bins. Statistical 
independence of the feature channels has been assumed 
for simplicity. 

The feature generation process is then modeled as a 
co-occurrence of an image site X{ £ X and a measured 
Gabor feature occurrence in one channel j/j = f k 3 . £ y , 
where rj denotes the Gabor channel and kj denotes the 
index of the (discretized) feature. The sample set Si for 
site Xi consists of all Gabor responses in a window cen- 
tered at X{, where the size of the window is chosen pro- 
portional to the filter scale [25]. Hence each image loca- 
tion Xi is effectively characterized by 12 one-dimensional 
histograms over Gabor coefficients. 

We have applied the ACM-based texture segmenta- 
tion algorithm to a collection of textured images. Fig. 13 
shows exemplary results for images which were ran- 
domly generated from the Brodatz texture collection of 
micro-textures. Fig. 14 shows similar results for mix- 
tures of aerial images. A detailed benchmark study of 
this novel segmentation algorithm including comparisons 
with state-of-the-art techniques will appear in a forth- 
coming paper. 
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7 Conclusion 

As the main contribution of this paper a novel class of 
statistical models for the analysis of co-occurrence data 
has been proposed and evaluated. We have introduced 
and discussed several different models, not only by enu- 
merating them as alternative approaches, but by distin- 
guishing them from a systematic point of view. The 
criterion to differentiate between these models is the 
way hidden variables are introduced, which effectively 
imposes constraints on the component distributions of 
the mixture. Several recently proposed statistical mod- 
els have turned out to be special cases. All models have 
a sound statistical foundation in that they define a gen- 
erative distribution, and all of them can be fitted by an 
(approximate) EM algorithm. 

Which of these models is the method of choice for a 
given problem crucially depends on the modeling goal. 
As we have argued, it is often required to detect groups 
structure or hierarchical representations. In these situ- 
ations one may be willing to sacrifice some precision in 
terms of statistical accuracy (i.e., perplexity reduction) 
to extract the structure of interest. Within the proposed 
framework models have been derived to extract group 
structure on either one or both object spaces and to 
model hierarchical dependencies of clusters. We strongly 
believe the proposed framework is flexible enough to be 
adapted to many different tasks. The generality of the 
developed methods has been stressed by discussing their 
benefits in the context of a broad range of potential ap- 
plications. 

In addition to the modeling problem we have also ad- 
dressed computational issues, in particular focusing on 
improved variants of the basic EM algorithm. Most im- 
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Figure 13: (a), (c) Two mixture images containing 5 textures each, (b), (d) The image segmentation obtained based 
on the ACM. 



portantly, our experiments underline the possible advan- 
tages of the annealed version of EM, which is a fruitful 
combination of ideas and methods from statistics and 
statistical physics. 
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Appendix A 

First, we establish an important relationship between 
the log-likelihood and a quantity known as free en- 
ergy in statistical physics [38]. Consider the data log- 
likelihood C = log P(S\9, R) as a function of the discrete 
hidden states R over 1Z for fixed parameters, and let 
%{R;S,9) = — C define a cost function on the hidden 
variable space. Minimizing E [H(R; S, 9)] w.r.t. prob- 
ability distributions over 1Z subject to a constraint on 
the entropy yields a quantity which is known as the free 
energy in statistical physics. This is generalized to non- 
uniform priors by fixing the relative entropy with respect 
to a prior distribution tt(R). Introducing a Lagrange pa- 
rameter T we arrive at the following objective function 
for probability distributions over the discrete space TZ 



T T (P\S, 9, tt) = E P \H{I)} + TE P 



loj 



P{R) 



<R) 



(64) 



The solution of the minimization problem associated 
with the generalized free energy in (64) is the (tilted) 
Gibbs distribution 



P(R\S, 9, tt) 



w(R) exp 



7i(R; S , 



*(R)[p(s\e,R)Y- 



(65) 



For T = 1 this is exactly the posterior probability of 
R. The posterior thus minimizes Tt at T = 1. The 
annealed EM algorithm is the generalization defined by 
an arbitrary choice of (the temperature) T [60]. In the 



E-step for T > 1 this amounts to discounting the likeli- 
hood as compared to the prior by taking it to the 1/T- 
th power. The M-step performs a minimization over 
Ep \H(I; S, 9)] and therefore Tt{P\S, 9, tt) with respect 
to 6 for fixed P where P is a Gibbs distribution which 
does not necessarily correspond to the true posterior. 
Notice that convergence of annealed EM is guaranteed 
since Tt (but not necessarily the likelihood itself) is a 
Lyapunov function. 

If an exact calculation of the posterior in the E-step 
is intractable, typically because of higher order correla- 
tions between the hidden variables as in the SCM, the 
optimization problem in (64) is restricted to factorial dis- 
tributions P(R\p ra ) = Ylr Yia Pra" • We ma ke use of the 
more suggestive notation (R ra ) = Pra to stress that the 
variational parameters p ra can actually be thought of as 
an approximation of the posterior marginals. This vari- 
ational technique is known as mean-field approximation 
[42, 2] and has been successfully applied for optimization 
problems [61, 23], in computer vision [15, 67, 24], and for 
inference in graphical models [55]. 

In general, solutions of the mean-field approximation 
have to fulfill the stationary conditions 



(Rr 



ir ra exp 



[H(R;S, 9, R ra — 1)) 



(66) 
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where expectations are taken with respect to P{R\p ra ) 
[24]. Notice that expected costs appear in the exponent, 
however the expectation is taken with respect to all hid- 
den variables except R ra itself which is fixed. In order 
to obtain a convergent iteration procedure in the general 
case one has to replace the 'synchronous' E-step update 
by a 'sequential' update. In the SCM, the coupling be- 
tween hidden variables is restricted to pairs of variables 
!,„ and Jjp with n 8 j > which allows us to recompute 
all posteriors for the I variables for given posteriors J in 
one sweep, and vice versa. 

Appendix B 

In order to preserve strict optimization principles, the 
predictive E-step variant has to be implemented slightly 
more careful than naively eliminating diagonal contribu- 
tions. Inserting the corrected estimates of 9 for given 
R into the complete data log-likelihood function (2) we 
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Figure 14: (a), (c) Two mixture images each containing 
7 textures extracted from aerial images, (b), (d) The 
image segmentation obtained based on the ACM. 



obtain the following cost function for R 



L K 



%{R; S) — 2_^ 2_^ R ra h r 



(67) 



:log ^ R ua + log ^ R ua - log ^Ruc 



,(») = , (r) 



i(«)=i(r) 



U^T 



which we refer to as predictive likelihood. Minimizing 
the free energy corresponding to (67) in a mean-field 
approximation yields a direct contribution proportional 
to exp[— h ra /T] but also additional terms which result 
from the indirect effect R ra has on h sa for other vari- 
ables R sa . We omit the details of the derivation which 
is purely technical. As a consequence one has to utilize 
sequential update to guarantee convergence of predictive 
EM. For reasons of efficiency we have ignored the indirect 
effects in the computation of the posterior probabilities 
in our experiments which empirically turned out to work 
well. 
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